Search CORE

11 research outputs found

teaMPI---replication-based resiliency without the (performance) pain.

Author: Bader Michael
Chamberlain Bradford L.
Hazelwood Benjamin
Juckeland Guido
Ltaief Hatem
Sadayappan Ponnuswamy
Samfass Philipp
Weinzierl Tobias
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

In an era where we can not afford to checkpoint frequently, replication is a generic way forward to construct numerical simulations that can continue to run even if hardware parts fail. Yet, replication often is not employed on larger scales, as naïvely mirroring a computation once effectively halves the machine size, and as keeping replicated simulations consistent with each other is not trivial. We demonstrate for the ExaHyPE engine—a task-based solver for hyperbolic equation systems—that it is possible to realise resiliency without major code changes on the user side, while we introduce a novel algorithmic idea where replication reduces the time-to-solution. The redundant CPU cycles are not burned “for nothing”. Our work employs a weakly consistent data model where replicas run independently yet inform each other through heartbeat messages whether they are still up and running. Our key performance idea is to let the tasks of the replicated simulations share some of their outcomes, while we shuffle the actual task execution order per replica. This way, replicated ranks can skip some local computations and automatically start to synchronise with each other. Our experiments with a production-level seismic wave-equation solver provide evidence that this novel concept has the potential to make replication affordable for large-scale simulations in high-performance computing

arXiv.org e-Print Archive

Durham Research Online

ExaHyPE: An engine for parallel dynamically adaptive simulations of wave problems

Author: Bader Michael
Bovard Luke
Charrier Dominic E.
Dumbser Michael
Duru Kenneth
Fambri Francesco
Gabriel Alice-Agnes
Gallard Jean-Matthieu
Krenz Lukas
Köppel Sven
Rannabauer Leonhard
Reinarz Anne
Rezzolla Luciano
Samfass Philipp
Tavelli Maurizio
Weinzierl Tobias
Publication venue: Elsevier
Publication date: 01/01/2020
Field of study

ExaHyPE (“An Exascale Hyperbolic PDE Engine”) is a software engine for solving systems of first-order hyperbolic partial differential equations (PDEs). Hyperbolic PDEs are typically derived from the conservation laws of physics and are useful in a wide range of application areas. Applications powered by ExaHyPE can be run on a student’s laptop, but are also able to exploit thousands of processor cores on state-of-the-art supercomputers. The engine is able to dynamically increase the accuracy of the simulation using adaptive mesh refinement where required. Due to the robustness and shock capturing abilities of ExaHyPE’s numerical methods, users of the engine can simulate linear and non-linear hyperbolic PDEs with very high accuracy. Users can tailor the engine to their particular PDE by specifying evolved quantities, fluxes, and source terms. A complete simulation code for a new hyperbolic PDE can often be realised within a few hours — a task that, traditionally, can take weeks, months, often years for researchers starting from scratch. In this paper, we showcase ExaHyPE’s workflow and capabilities through real-world scenarios from our two main application areas: seismology and astrophysics

arXiv.org e-Print Archive

Durham Research Online

MPG.PuRe

The EU Center of Excellence for Exascale in Solid Earth (ChEESE): Implementation, results, and roadmap for the second phase

Author: Abril Claudia
Afanasiev Michael
Amati Giorgio
Aniko Wirp Sara
Bader Michael
Badia Rosa M.
Barsotti Sara
Basili Roberto
Bayraktar Hafize B.
Bernardi Fabrizio
Boehm Christian
Brizuela Beatriz
Brogi Federico
Cabrera Eduardo
Casarotti Emanuele
Castro Manuel J.
Cerminara Matteo
Cheptsov Alexey
Cirella Antonella
Conejero Javier
Costa Antonio
de la Asunción Marc
de la Puente Josep
Djuric Marco
Dorozhinskii Ravil
Espinosa Gabriela
Esposti-Ongaro Tomaso
Farnós Joan
Favretto-Cristini Nathalie
Fichtner Andreas
Folch Arnau
Fournier Alexandre
Gabriel Alice-Agnes
Gallard Jean-Matthieu
Gibbons Steven John
Glimsdal Sylfest
González-Vida José Manuel
Gracia Jose
Gregorio Rose
Gutierrez Natalia
Halldorsson Benedikt
Hamitou Okba
Houzeaux Guillaume
Jaure Stephan
Kessar Mouloud
Krenz Lukas
Krischer Lion
Laforet Soline
Lanucara Piero
Li Bo
Lorenzino Maria Concetta
Lorito Stefano
Løvholt Finn
Macedonio Giovanni
Macías Jorge
Martínez Montesinos Beatriz
Marín Guillermo
Mingari Leonardo
Moguilny Geneviève
Montellier Vadim
Monterrubio-Velasco Marisol
Moulard Georges Emmanuel
Nagaso Masaru
Nazaria Massimo
Niethammer Christoph
Pardini Federica
Pienkowska Marta
Pizzimenti Luca
Poiata Natalia
Rannabauer Leonhard
Rodriguez Juan Esteban
Rojas Otilio
Romano Fabrizio
Rudyy Oleksandr
Ruggiero Vittorio
Samfass Philipp
Sanchez Sabrina
Sandri Laura
Scala Antonio
Schaeffer Nathanael
Schuchart Joseph
Selva Jacopo
Sergeant Amadine
Stallone Angela
Sánchez-Linares Carlos
Taroni Matteo
Thrastarson Soelvi
Titos Manuel
Tonelllo Nadia
Tonini Roberto
Ulrich Thomas
Vilotte Jean-Pierre
Volpe Manuela
Vöge Malte
Wössner Uwe
Publication venue
Publication date: 01/01/2023
Field of study

publishedVersio

HAL AMU

Norwegian Geotechnical Institute (NGI) Digital Archive

Towards a deeper understanding of hybrid programming

Author: Samfass Philipp Johannes
Publication venue
Publication date
Field of study

With the end of Dennard scaling, future high performance computers are expected to consist of distributed nodes that comprise more cores with direct access to shared memory on a node. However, many parallel applications still use a pure message-passing programming model based on the message-passing interface (MPI). Thereby, they potentially do not make optimal use of shared memory resources. The pure message-passing approach---as argued in this work---is not necessarily the best fit to current and future supercomputing architectures. In this thesis, I therefore present a detailed performance analysis of so-called hybrid programming models that aim at improving performance by combining a shared memory model with the message-passing model on current symmetric multiprocessor (SMP) systems. First, inter-node communication performance is investigated in the context of (hybrid) message-passing programs. A novel performance model for estimating communication performance on current SMP nodes is presented. As is demonstrated, in contrast to the typically used classic postal performance model, the new model allows to more accurately predict inter-node communication performance in the presence of simultaneously communicating processes and saturation of the network interface controller on current multicore architectures. The implications of the new model on hybrid programs are discussed. In addition, I demonstrate the (current) difficulties of multithreaded MPI communication based on results obtained for a multithreaded ping pong benchmark. Moreover, I show how intra-node MPI communication performance can significantly be improved upon for small to medium size messages by saving message-passing overhead and/or superior cache usage. This is achieved through a direct copy in shared memory using either the hybrid MPI+MPI or the MPI+OpenMP programming method. Furthermore, I contrast and evaluate several (pure and hybrid) implementation options for a structured grid sparse matrix-vector multiplication in depth. These choices differ in how hybrid parallelism is exploited at the application level (coarse-grained vs. fine-grained problem decomposition) and with respect to the hybrid programming systems (pure MPI vs. MPI+MPI vs. MPI+OpenMP). I discuss their performance factors such as locality, overhead, efficient use of MPI's derived datatypes, and the serial fraction in Amdahl's law. Moreover, I experimentally demonstrate how a coarse-grained hybrid application design can be used to control these factors, resulting in significant performance improvements (compared to a pure MPI parallelization) in communication and/or synchronization for both the hybrid MPI+MPI and MPI+OpenMP parallel programming approaches for different grid decompositions.U of I OnlyGraduate College Thesis Office approved request from author to change restriction to U of I Access for a period of two years. Implemented by [email protected] on 2018-08-08 at 11:35 AM CD

Hybrid MPI+OpenMP Reactive Work Stealing in Distributed Memory in the PDE Framework sam(oa)2

Author: Bader Michael
Klinkenberg Jannis
Samfass Philipp
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Crossref

Publikationsserver der RWTH Aachen University

Doubt and Redundancy Kill Soft Errors---Towards Detection and Correction of Silent Data Corruption in Task-based Numerical Software

Author: Bader Michael
Reinarz Anne
Samfass Philipp
Weinzierl Tobias
Publication venue: Institute of Electrical and Electronics Engineers
Publication date: 18/10/2021
Field of study

Resilient algorithms in high-performance computing are subject to rigorous non-functional constraints. Resiliency must not increase the runtime, memory footprint or I/O demands too significantly. We propose a task-based soft error detection scheme that relies on error criteria per task outcome. They formalise how “dubious” an outcome is, i.e. how likely it contains an error. Our whole simulation is replicated once, forming two teams of MPI ranks that share their task results. Thus, ideally each team handles only around half of the workload. If a task yields large error criteria values, i.e. is dubious, we compute the task redundantly and compare the outcomes. Whenever they disagree, the task result with a lower error likeliness is accepted. We obtain a self-healing, resilient algorithm which can compensate silent floating-point errors without a significant performance, I/O or memory footprint penalty. Case studies however suggest that a careful, domain-specific tailoring of the error criteria remains essential

arXiv.org e-Print Archive

Durham Research Online

Lightweight Task Offloading Exploiting MPI Wait Times for Parallel Adaptive Mesh Refinement

Author: Bader Michael
Charrier Dominic E.
Samfass Philipp
Weinzierl Tobias
Publication venue: Wiley
Publication date: 14/04/2020
Field of study

Balancing the workload of sophisticated simulations is inherently difficult, since we have to balance both computational workload and memory footprint over meshes that can change any time or yield unpredictable cost per mesh entity, while modern supercomputers and their interconnects start to exhibit fluctuating performance. We propose a novel lightweight balancing technique for MPI+X to accompany traditional, prediction‐based load balancing. It is a reactive diffusion approach that uses online measurements of MPI idle time to migrate tasks temporarily from overloaded to underemployed ranks. Tasks are deployed to ranks which otherwise would wait, processed with high priority, and made available to the overloaded ranks again. This migration is nonpersistent. Our approach hijacks idle time to do meaningful work and is totally nonblocking, asynchronous and distributed without a global data view. Tests with a seismic simulation code developed in the ExaHyPE engine uncover the method's potential. We found speed‐ups of up to 2‐3 for ill‐balanced scenarios without logical modifications of the code base and show that the strategy is capable to react quickly to temporarily changing workload or node performance

arXiv.org e-Print Archive

Durham Research Online

Crossref

CHAMELEON: Reactive Load Balancing for Hybrid MPI+OpenMP Task-Parallel Applications

Author: Bader Michael
Klinkenberg Jannis
Müller Matthias S.
Samfass Philipp
Terboven Christian
Publication venue: 'Elsevier BV'
Publication date: 01/01/2020
Field of study

Publikationsserver der RWTH Aachen University

teaMPI---replication-based resiliency without the (performance) pain

Author: Bader Michael
Chamberlain Bradford L.
Hazelwood Benjamin
Juckeland Guido
Ltaief Hatem
Sadayappan Ponnuswamy
Samfass Philipp
Weinzierl Tobias
Publication venue: Springer Verlag
Publication date: 15/06/2020
Field of study

Durham Research Online